Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNV-47438: ipam, virt: graduate persistent ips feature gate to GA #2058

Merged

Conversation

maiqueb
Copy link
Contributor

@maiqueb maiqueb commented Oct 9, 2024

This PR graduates the PersistentIPsForVirtualization feature gate to GA.

Copy link
Contributor

openshift-ci bot commented Oct 9, 2024

Hello @maiqueb! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 9, 2024
@openshift-ci openshift-ci bot requested review from bparees and deads2k October 9, 2024 13:55
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 27, 2024
@maiqueb maiqueb force-pushed the graduate-persistent-ips-api branch from 69d8740 to 8717515 Compare November 4, 2024 15:02
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2024
@maiqueb
Copy link
Contributor Author

maiqueb commented Nov 14, 2024

/test verify

@maiqueb
Copy link
Contributor Author

maiqueb commented Nov 26, 2024

@JoelSpeed o/

We have added periodic lanes to trigger the test for this feature gate in this PR, which we plan on GAing for 4.18.

The tests are quite thorough, and were added in this other PR.

Unfortunately, while the tests themselves pass, it seems the monitor tests fail constantly due to the monitor tests. We have been trying to get this sorted out, but everytime we look, another component of OpenShift Virt throws a new event / alarm / fails to adhere to conformance testing - OpenShift virt uses their own conformance testing framework, and the hypershift lanes we have in origin do not run monitor tests afaiu.

This is a list of items we have gone through in the last weeks:

As a result, we plan on having the OpenShift Virt QE manually sign-off the feature is stable enough.

This is a heads-up so you know our intentions. Do you foresee any issues with this path ?

How does the process look like for the QE signoff ? Just setting qe-approved label ? Who overrides the failing verify lane ?

@JoelSpeed
Copy link
Contributor

Looking at prow I can only see two runs of the periodics you've added, that, doesn't seem right?

Both have failed and seem to be complaining that there are no tests to run?

We normally ask to see test results in sippy, could you see if you can find any test data in sippy for the tests you've added? TRT may be able to help you if you can't work this out

@JoelSpeed
Copy link
Contributor

Unfortunately, while the tests themselves pass, it seems the monitor tests fail constantly due to the monitor tests.

I can't see any evidence of this, please link

@maiqueb
Copy link
Contributor Author

maiqueb commented Nov 26, 2024

Unfortunately, while the tests themselves pass, it seems the monitor tests fail constantly due to the monitor tests.

I can't see any evidence of this, please link

Surely; if you open the Tests Passed in the periodic 4.18 execution and search for PersistentIPsForVirtualization you'll see all of them have passed.

@maiqueb
Copy link
Contributor Author

maiqueb commented Nov 26, 2024

Looking at prow I can only see two runs of the periodics you've added, that, doesn't seem right?

Both have failed and seem to be complaining that there are no tests to run?

I'd say this is on 4.19, where I am running the cluster without TP feature set.

We normally ask to see test results in sippy, could you see if you can find any test data in sippy for the tests you've added? TRT may be able to help you if you can't work this out

@JoelSpeed
Copy link
Contributor

Surely; if you open the Tests Passed in the periodic 4.18 execution and search for PersistentIPsForVirtualization you'll see all of them have passed.

Ahh I see now, right, so these tests are passing, but don't show up in sippy, which means we can't actually track or visualise them. I've started some threads in slack in #forum-ocp-release-oversight. I'd like to understand why this isn't in sippy, once we have it in sippy, we can visualise your tests, and, looking at the results, it may actually satisfy the requirements to pass the verify naturally

@maiqueb
Copy link
Contributor Author

maiqueb commented Nov 27, 2024

Surely; if you open the Tests Passed in the periodic 4.18 execution and search for PersistentIPsForVirtualization you'll see all of them have passed.

Ahh I see now, right, so these tests are passing, but don't show up in sippy, which means we can't actually track or visualise them. I've started some threads in slack in #forum-ocp-release-oversight. I'd like to understand why this isn't in sippy, once we have it in sippy, we can visualise your tests, and, looking at the results, it may actually satisfy the requirements to pass the verify naturally

Oh nice. Hopefully :D

@maiqueb maiqueb changed the title ipam, virt: graduate persistent ips feature gate to GA CNV-47438: ipam, virt: graduate persistent ips feature gate to GA Nov 27, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 27, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 27, 2024

@maiqueb: This pull request references CNV-47438 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set.

In response to this:

This PR graduates the PersistentIPsForVirtualization feature gate to GA.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@maiqueb
Copy link
Contributor Author

maiqueb commented Nov 27, 2024

/jira refresh

@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 27, 2024

@maiqueb: This pull request references CNV-47438 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@yossisegev
Copy link

yossisegev commented Dec 3, 2024

Update from CNV QE:
I managed to verify UDN persistentIP (on CNV VMs) for both migration and VM restart scenarios.
Verification was executed on an 4.18 PSI (virtual) cluster.

The scenario I ran for verifying persistent IP after restart:

  1. I created 2 VMs with primary UDN interfaces.
  2. On vm-a I ran an iperf3 server, which listens on a specific port:
    [fedora@vm-a ~]$ iperf3 -s -p 2345
    -----------------------------------------------------------
    Server listening on 2345 (test 1)
    -----------------------------------------------------------
  3. On vm-b I ran an iperf3 client, to maintain a connection to vm-a primary interface IP (via the same port, of course):
    [fedora@vm-b ~]$ iperf3 -c 192.168.88.4 -p 2345 -t 0
    Connecting to host 192.168.88.4, port 2345
    This connection succeeded.
  4. I restarted vm-a using virtctl restart vm-a
  5. When vm-a came up again, I
    a. Verified eth0 still has the same IP address.
    b. Re-ran the iperf3 server and client on the VMs, and let the client connect to the same IP address (and port) as before the restart.
    Once again - a successful connection was observed in both sides.

For migration I ran almost the same scenario, with the difference of stopping the iperf3 session in both client and server before migrating the VM, and restarted it (using the same persistent IP address) after the migration was completed.
This scenario passed successfully as well.

@JoelSpeed
Copy link
Contributor

/test verify

@deads2k
Copy link
Contributor

deads2k commented Dec 4, 2024

/hold

If we're facing SCC pinning failures on components, that's a very serious problem that in 4.19 can cause non-functional clusters with very high bandwidth costs as components attempt to create pods with unexpected SCCs that don't conform to PSA. This happens in customer environments with custom SCCs, so passing CI doesn't mean that the bug is "safe". It just means it is latent. This has happened in the past and was extremely costly. Get PRs open to fix the SCC pinning annotations.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 4, 2024
@deads2k
Copy link
Contributor

deads2k commented Dec 4, 2024

While the desire is to GA the feature in 4.18, if it breaks on 4.19, we've simply created a problem for ourselves. Let's get a PR up adding the annotations for the failing workloads in https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/58873/rehearse-58873-pull-ci-openshift-ovn-kubernetes-release-4.19-e2e-aws-ovn-virt-techpreview/1864239266813448192

@deads2k
Copy link
Contributor

deads2k commented Dec 4, 2024

We've got the ball rolling on these violations with kubevirt/cluster-network-addons-operator#1931 and more will be coming. These changes are important and lack of this pinning carries some risk in 4.18 and substantial risk in 4.19. Given 4.18 timeframes and a commitment to address the problem in 4.19, we can promote this feature once the rest of the testing functions.

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 4, 2024
@JoelSpeed
Copy link
Contributor

/retest

@JoelSpeed
Copy link
Contributor

The periodic this morning failed on a persistent IPs test, can you please investigate and let me know here why this failed this time? https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-ovn-kubernetes-release-4.19-periodics-e2e-aws-ovn-virt-techpreview-periodic/1864460089788731392

@maiqueb
Copy link
Contributor Author

maiqueb commented Dec 5, 2024

The periodic this morning failed on a persistent IPs test, can you please investigate and let me know here why this failed this time? https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-ovn-kubernetes-release-4.19-periodics-e2e-aws-ovn-virt-techpreview-periodic/1864460089788731392

The TL;DR; is we're hitting this bug: https://issues.redhat.com/browse/OCPBUGS-42609 on the primary UDN feature.

The failing test is using the following parameters:

  • provision network via NAD
  • primary UDN

We create a server pod - e2e-test-network-segmentation-e2e-jljp9/testpod-ip-10-0-38-150.ec2.internal - which is supposed to have 2 network attachments:

  • default cluster network
  • primary UDN (which was provisioned using the NAD)

From the CNI result (which we can see on the OVN-K logs) we can see that only the default cluster network attachment was added:

Z I1205 01:17:19.211554    4193 cni.go:323] [e2e-test-network-segmentation-e2e-jljp9/testpod-ip-10-0-38-150.ec2.internal 9984a1dc2f16e647acde86605d219479cd7ce0c9b25b2f49cd8cb96b9ec48c80 network default NAD default] ADD finished CNI request [e2e-test-network-segmentation-e2e-jljp9/testpod-ip-10-0-38-150.ec2.internal 9984a1dc2f16e647acde86605d219479cd7ce0c9b25b2f49cd8cb96b9ec48c80 network default NAD default], result "{\"interfaces\":[{\"name\":\"9984a1dc2f16e64\",\"mac\":\"36:5e:57:2f:79:5b\"},{\"name\":\"eth0\",\"mac\":\"0a:58:0a:82:00:71\",\"sandbox\":\"/var/run/netns/907fbf44-3cc3-4c97-9ef1-78411ed8fa08\"}],\"ips\":[{\"interface\":1,\"address\":\"10.130.0.113/23\",\"gateway\":\"10.130.0.1\"}],\"dns\":{}}", err <nil>

The VM is trying to reach the server pod using the primary UDN, which was not plumbed into the pod.

As indicated in the bug, this race can happen.

I think I'll add some code to the tests after provisioning the NAD / UDN, and before provisioning any workload to assure that an OVN logical entity for the UDN was created. This means that the NAD/UDN was processed, and that while the network itself might not be ready, it will eventually be.

A Sleep could also do, but that sucks. I don't know for how long to sleep.

FWIW - the network segmentation feature gate is also prone to this race, and I expect to see this happening on their lanes as well.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 5, 2024
@deads2k
Copy link
Contributor

deads2k commented Dec 5, 2024

/hold cancel

it's an issue with UDN enabled, but with something called a secondary network it seems to work reliably? It sure would be nice if everything worked at the same time....

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 5, 2024
@maiqueb
Copy link
Contributor Author

maiqueb commented Dec 12, 2024

/test verify

@JoelSpeed
Copy link
Contributor

/payload-aggregate periodic-ci-openshift-ovn-kubernetes-release-4.19-periodics-e2e-aws-ovn-virt-techpreview-periodic 10

Copy link
Contributor

openshift-ci bot commented Dec 12, 2024

@JoelSpeed: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-ovn-kubernetes-release-4.19-periodics-e2e-aws-ovn-virt-techpreview-periodic

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b1380e20-b875-11ef-9100-1cf11bd6ba9c-0

@JoelSpeed
Copy link
Contributor

@JoelSpeed
Copy link
Contributor

/payload-aggregate periodic-ci-openshift-ovn-kubernetes-release-4.19-periodics-e2e-aws-ovn-virt-periodic 10

Copy link
Contributor

openshift-ci bot commented Jan 8, 2025

@JoelSpeed: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-ovn-kubernetes-release-4.19-periodics-e2e-aws-ovn-virt-periodic

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a5b3c130-cd55-11ef-9453-12bb6b8577d4-0

@maiqueb maiqueb force-pushed the graduate-persistent-ips-api branch from 8717515 to b701a1f Compare January 8, 2025 08:43
@JoelSpeed
Copy link
Contributor

/payload-aggregate periodic-ci-openshift-ovn-kubernetes-release-4.19-periodics-e2e-aws-ovn-virt-periodic 10

Copy link
Contributor

openshift-ci bot commented Jan 8, 2025

@JoelSpeed: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-ovn-kubernetes-release-4.19-periodics-e2e-aws-ovn-virt-periodic

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/37cb4550-cd9e-11ef-850c-8b451ba1c95d-0

Copy link
Contributor

openshift-ci bot commented Jan 8, 2025

@maiqueb: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn b701a1f link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@JoelSpeed
Copy link
Contributor

/payload-aggregate periodic-ci-openshift-ovn-kubernetes-release-4.19-periodics-e2e-aws-ovn-virt-periodic 10

Copy link
Contributor

openshift-ci bot commented Jan 10, 2025

@JoelSpeed: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-ovn-kubernetes-release-4.19-periodics-e2e-aws-ovn-virt-periodic

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d5a5f9b0-cf37-11ef-9b28-47182ea2adf8-0

@maiqueb
Copy link
Contributor Author

maiqueb commented Jan 10, 2025

Seems the tests didn't fail in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/aggregator-periodic-ci-openshift-ovn-kubernetes-release-4.19-periodics-e2e-aws-ovn-virt-periodic/1877653489371320320/artifacts/release-analysis-prpqr-aggregator/openshift-release-analysis-prpqr-aggregator/artifacts/release-analysis-aggregator/job-run-summary.html ...

Most of the reported failures are from the missing SCC / cluster broken.

Would it make sense to temporarily add the openshift-cnv namespace to the list of namespaces that can miss the required-scc annotation ? I've pushed this PR for that a while back: openshift/origin#29341

@JoelSpeed
Copy link
Contributor

Based on the aggregate run and current sippy results, I can see that the parts of this feature that do not rely on the NetworkSegmentation feature gate are stable and pass consistently.

There is an outstanding IOU from the virt folks to fix their SCC annotations, but we are not blocking this feature on that.

/override ci/prow/verify

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 10, 2025
Copy link
Contributor

openshift-ci bot commented Jan 10, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed, maiqueb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 10, 2025
Copy link
Contributor

openshift-ci bot commented Jan 10, 2025

@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/verify

In response to this:

Based on the aggregate run and current sippy results, I can see that the parts of this feature that do not rely on the NetworkSegmentation feature gate are stable and pass consistently.

There is an outstanding IOU from the virt folks to fix their SCC annotations, but we are not blocking this feature on that.

/override ci/prow/verify

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@maiqueb
Copy link
Contributor Author

maiqueb commented Jan 10, 2025

/label acknowledge-critical-fixes-only

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Jan 10, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit c1a063b into openshift:master Jan 10, 2025
20 of 21 checks passed
@maiqueb
Copy link
Contributor Author

maiqueb commented Jan 11, 2025

/cherry-pick release-4.18

@openshift-cherrypick-robot

@maiqueb: #2058 failed to apply on top of branch "release-4.18":

Applying: ipam, virt: graduate persistent ips feature gate to GA
Using index info to reconstruct a base tree...
M	features.md
M	features/features.go
M	payload-manifests/featuregates/featureGate-Hypershift-Default.yaml
M	payload-manifests/featuregates/featureGate-SelfManagedHA-Default.yaml
Falling back to patching base and 3-way merge...
Auto-merging payload-manifests/featuregates/featureGate-SelfManagedHA-Default.yaml
CONFLICT (content): Merge conflict in payload-manifests/featuregates/featureGate-SelfManagedHA-Default.yaml
Auto-merging payload-manifests/featuregates/featureGate-Hypershift-Default.yaml
CONFLICT (content): Merge conflict in payload-manifests/featuregates/featureGate-Hypershift-Default.yaml
Auto-merging features/features.go
Auto-merging features.md
CONFLICT (content): Merge conflict in features.md
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 ipam, virt: graduate persistent ips feature gate to GA

In response to this:

/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-config-api
This PR has been included in build ose-cluster-config-api-container-v4.19.0-202501110011.p0.gc1a063b.assembly.stream.el9.
All builds following this will include this PR.

dfitzmau pushed a commit to dfitzmau/api that referenced this pull request Jan 14, 2025
…s-api

CNV-47438: ipam, virt: graduate persistent ips feature gate to GA
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants